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The  Department  of  Defense  and  Department  of  Homeland  Security  use 
many  threat  detection  systems,  such  as  air  cargo  screeners  and  counter-im- 
provised-explosive-device  systems.  Threat  detection  systems  that  perform 
well  during  testing  are  not  always  well  received  by  the  system  operators, 
however.  Some  systems  may  frequently  “cry  wolf,”  generating  false  alarms 
when  true  threats  are  not  present.  As  a  result,  operators  lose  faith  in  the 
systems— ignoring  them  or  even  turning  them  off  and  taking  the  chance 
that  a  true  threat  will  not  appear.  This  article  reviews  statistical  concepts 
to  reconcile  the  performance  metrics  that  summarize  a  developer’s  view  of 
a  system  during  testing  with  the  metrics  that  describe  an  operator’s  view  of 
the  system  during  real-world  missions.  Program  managers  can  still  make 
use  of  systems  that  “cry  wolf”  by  arranging  them  into  a  tiered  system  that, 
overall,  exhibits  better  performance  than  each  individual  system  alone. 
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The  Department  of  Defense  (DoD)  and  Department  of  Homeland 
Security  (DHS)  operate  many  threat  detection  systems.  Examples  include 
counter-mine  and  counter-improvised-explosive-device  (IED)  systems 
and  airplane  cargo  screening  systems  (Daniels,  2006;  L3  Communications 
Cyterra,  2012;  L3  Communications,  Security  &  Detection  Systems,  2011, 
2013, 2014;  Niitek,  n.d.;  Transportation  Security  Administration,  2013;  U.S. 
Army,  n.d.;  Wilson,  Gader,  Lee,  Frigui,  &  Ho,  2007).  All  of  these  systems 
share  a  common  purpose:  to  detect  threats  among  clutter. 

Threat  detection  systems  are  often  assessed  based  on  their  Probability  of 
Detection  (P  )  and  Probability  of  False  Alarm  (Pfa).  Pd  describes  the  fraction 
of  true  threats  for  which  the  system  correctly  declares  an  alarm.  Conversely, 
P  describes  the  fraction  of  true  clutter  (true  non-threats)  for  which  the 
system  incorrectly  declares  an  alarm— a  false  alarm.  A  perfect  system  will 
exhibit  a  Pj  of  1  and  a  P,  of  0.  P ,  and  Pr  are  summarized  in  Table  1  and  dis- 

d  fa  d  fa 

cussed  in  Urkowitz  (1967). 


TABLE  1.  DEFINITIONS  OF  COMMON  METRICS  USED  TO  ASSESS 

PERFORMANCE  OF  THREAT  DETECTION  SYSTEMS 

Metric 

Definition 

Perspective 

Probability  of 
Detection  (Pd) 

The  fraction  of  all  items  containing 
a  true  threat  for  which  the  system 
correctly  declared  an  alarm 

Developer 

Probability  of 
False  Alarm  (Pfa) 

The  fraction  of  all  items  not  containing 
a  true  threat  for  which  the  system 
Incorrectly  declared  an  alarm 

Developer 

Positive 

Predictive  Value 
(PPV) 

The  fraction  of  all  items  causing  an 
alarm  that  did  end  up  containing  a  true 
threat 

Operator 

Negative 
Predictive  Value 
(NPV) 

The  fraction  of  all  items  not  causing  an 
alarm  that  did  end  up  not  containing  a 
true  threat 

Operator 

Prevalence 

(Prev) 

The  fraction  of  items  that  contained  a 
true  threat  (regardless  of  whether  the 
system  declared  an  alarm) 

- 

False  Alarm  Rate 
(FAR) 

The  number  of  false  alarms  per  unit 
time,  area,  or  distance 

- 

Threat  detection  systems  with  good  Pd  and  P  performance  metrics  are 
not  always  well  received  by  the  system’s  operators,  however.  Some  systems 
may  frequently  “cry  wolf,”  generating  false  alarms  when  true  threats  are  not 
present.  As  a  result,  operators  may  lose  faith  in  the  systems,  delaying  their 
response  to  alarms  (Getty,  Swets,  Pickett,  &  Gonthier,  1995)  or  ignoring 
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them  altogether  (Bliss,  Gilson,  &  Deaton,  1995),  potentially  leading  to  disas¬ 
trous  consequences.  This  issue  has  arisen  in  military,  national  security,  and 
civilian  scenarios. 


The  New  York  Times  described  a  1987  military  incident  involving  the  threat 
detection  system  installed  on  a  $300  million  high-tech  warship  to  track 
radar  signals  in  the  waters  and  airspace  off  Bahrain.  Unfortunately,  “some¬ 
body  had  turned  off  the  audible  alarm  because  its  frequent  beeps  bothered 
him”  (Cushman,  1987,  p.  1).  The  radar  operator  was  looking  away  when  the 
system  flashed  a  sign  alerting  the  presence  of  an  incoming  Iraqi  jet.  The 
attack  killed  37  sailors. 

That  same  year,  The  New  York  Times  reported  a  similar  civilian  incident 
in  the  United  States.  An  Amtrak  train  collided  near  Baltimore,  Maryland, 
killing  15  people  and  injuring  176.  Investigators  found  that  an  alarm  whistle 
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in  the  locomotive  cab  had  been  “substantially  disabled  by  wrapping  it  with 
tape”  and  “train  crew  members  sometimes  muffle  the  warning  whistle 
because  the  sound  is  annoying”  (Stuart,  1987,  p.  1). 

Such  incidents  continued  to  occur  two  decades  later.  In  2006,  The  Los  Angeles 
Times  described  an  incident  in  which  a  radar  air  traffic  control  system  at 
Los  Angeles  International  Airport  (LAX)  issued  a  false  alarm,  prompting 
the  human  controllers  to  “turn  off  the  equipment’s  aural  alert”  (Oldham, 
2006,  p.  2).  Two  days  later,  a  turboprop  plane  taking  off  from  the  airport 
narrowly  missed  a  regional  jet,  the  “closest  call  on  the  ground  at  LAX”  in  2 
years  (Oldham,  2006,  p.  2).  This  incident  had  homeland  security  implications, 
since  DHS  and  the  Department  of  Transportation  are  co-sector-specific 
agencies  for  the  Transportation  Systems  Sector,  which  governs  air  traffic 
control  (DHS,  2016). 


The  disabling  of  threat  detection  systems  due  to  false  alarms  is  troubling. 
This  behavior  often  arises  from  an  inappropriate  choice  of  metrics  used  to 
assess  the  system’s  performance  during  testing.  While  Pd  and  P  encapsu¬ 
late  the  developer’s  perspective  of  the  system’s  performance,  these  metrics 
do  not  encapsulate  the  operator’s  perspective.  The  operator’s  view  can  be 
better  summarized  with  other  metrics,  namely  Positive  Predictive  Value 
(PPV)  and  Negative  Predictive  Value  (NPV).  PPV 
describes  the  fraction  of  all  alarms  that 
correctly  turn  out  to  be  true 
threats— a  measure  of  how 
often  the  system  does  not  “cry 
wolf.”  Similarly,  NPV  describes 
the  fraction  of  all  lack  of  alarms 
that  correctly  turn  out  to  be 
true  clutter.  From  the  oper¬ 
ator’s  perspective,  a  perfect 
system  will  have  PPV  and 
NPV  values  equal  to  1.  PPV  and 
NPV  are  summarized  in  Table  1  and  discussed  in 
Altman  and  Bland  (1994b). 

Interestingly  enough,  the  very  same  threat 
detection  system  that  satisfies  the  developer’s 
desire  to  detect  as  much  truth  as  possible  can 
also  disappoint  the  operator  by  generating 
false  alarms,  or  “crying  wolf,”  too  often 
T  (Scheaffer  &  McClave,  1995).  A  system 
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can  exhibit  excellent  Pd  and  Pfe  values  while  also  exhibiting  a  poor  PPV  value. 
Unfortunately,  low  PPV  values  naturally  occur  when  the  Prevalence  (Prev) 
of  true  threat  among  true  clutter  is  extremely  low  (Parasuraman,  1997; 
Scheaffer  &  McClave,  1995),  as  is  often  the  case  in  defense  and  homeland 
security  scenarios.  As  summarized  in  Table  1,  Prev  is  a  measure  of  how 
widespread  or  common  the  true  threat  is.  A  Prev  of  1  indicates  a  true  threat 
is  always  present,  while  a  Prev  of  0  indicates  a  true  threat  is  never  present. 
As  will  be  shown,  a  low  Prev  can  lead  to  a  discrepancy  in  how  developers 
and  operators  view  the  performance  of  threat  detection  systems  in  the  DoD 
and  DHS. 

In  this  article,  the  author  reconciles  the  performance  metrics  used  to  quan¬ 
tify  the  developer’s  versus  operator’s  views  of  threat  detection  systems. 
Although  these  concepts  are  already  well  known  within  the  statistics  and 
human  factors  communities,  they  are  not  often  immediately  understood  in 
the  DoD  and  DHS  science  and  technology  (S&T)  acquisition  communities. 
This  review  is  intended  for  program  managers  (PM)  of  threat  detection 
systems  in  the  DoD  and  DHS.  This  article  demonstrates  howto  calculate  P 
P  ,  PPV,  and  NPV  using  a  notional  air  cargo  screening  system  as  an  example. 
Then  it  illustrates  how  a  PM  can  still  make  use  of  a  system  that  frequently 
“cries  wolf”  by  incorporating  it  into  a  tiered  system  that,  overall,  exhibits 
better  performance  than  each  individual  system  alone.  Finally,  the  author 
cautions  that  P  and  NPV  can  be  calculated  only  for  threat  classification 
systems,  rather  than  genuine  threat  detection  systems.  False  Alarm  Rate 
is  often  calculated  in  place  of  P 


Testing  a  Threat  Detection  System 

A  notional  air  cargo  screening  system  illustrates  the  discussion  of  per¬ 
formance  metrics  for  threat  detection  systems.  As  illustrated  by  Figure  1,  the 
purpose  of  this  notional  system  is  to  detect  explosive  threats  packed  inside 
items  that  are  about  to  be  loaded  into  the  cargo  hold  of  an  airplane.  To  deter¬ 
mine  how  well  this  system  meets  capability  requirements,  its  performance 
must  be  quantified.  A  large  number  of  items  is  input  into  the  system,  and  each 
item’s  ground  truth  (whether  the  item  contained  a  true  threat)  is  compared 
to  the  system’s  output  (whether  the  system  declared  an  alarm).  The  items  are 
representative  of  the  items  that  the  system  would  likely  encounter  in  an  oper¬ 
ational  setting.  At  the  end  of  the  test,  the  True  Positive  (TP),  False  Positive 
(FP),  False  Negative  (FN),  and  True  Negative  (TN)  items  are  counted.  Figure 
2  tallies  these  counts  in  a  2  x  2  confusion  matrix: 
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•  A  TP  is  an  item  that  contained  a  true  threat,  and  for  which  the 
system  correctly  declared  an  alarm. 

•  An  FP  is  an  item  that  did  not  contain  a  true  threat,  but  for 
which  the  system  incorrectly  declared  an  alarm— a  false  alarm 
(a  Type  I  error). 

•  An  FN  is  an  item  that  contained  a  true  threat,  but  for  which  the 
system  incorrectly  did  not  declare  an  alarm  (a  Type  II  error). 

•  A  TN  is  an  item  that  did  not  contain  a  true  threat,  and  for  which 
the  system  correctly  did  not  declare  an  alarm. 


FIGURE  1.  NOTIONAL  AIR  CARGO  SCREENING  SYSTEM 


Note.  A  set  of  predefined,  discrete  items  (small  brown  boxes)  are  presented  to  the  system 
one  at  a  time.  Some  items  contain  a  true  threat  (orange  star)  among  clutter,  while  other 
items  contain  clutter  only  (no  orange  star).  For  each  item,  the  system  declares  either  one 
or  zero  alarms.  All  items  for  which  the  system  declares  an  alarm  (red  exclamation  point) 
are  further  examined  manually  by  trained  personnel  (purple  figure).  In  contrast,  all  items 
for  which  the  system  does  not  declare  an  alarm  (green  checkmark)  are  left  unexamined 
and  loaded  directly  onto  the  airplane. 


As  shown  in  Figure  2,  a  total  of  10,100  items  passed  through  the  notional  air 
cargo  screening  system.  One  hundred  items  contained  a  true  threat  while 
10,000  items  did  not.  The  system  declared  an  alarm  for  590  items  and  did 
not  declare  an  alarm  for  9,510  items.  Comparing  the  items’  ground  truth  to 
the  system’s  alarms  (or  lack  thereof),  there  were  90  TPs,  10  FNs,  500  FPs, 
and  9,500  TNs. 
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FIGURE  2.  2  X  2  CONFUSION  MATRIX  OF 
NOTIONAL  AIR  CARGO  SCREENING  SYSTEM 


The  Developer’s  View 

Probability  of  Detection: 

Pd=  90/  (90  +  10)  =  0.90  ^ 
(near  1  is  better) 


Probability  of  False  Alarm: 

Pfa  =  500  /  (500  +  9500)  =  0.05  ^ 
(near  0  is  better) 

Positive  Predictive  Value:  Negative  Predictive  Value: 

PPV  =  90  /  (90  +  500)  =  0.15  /  NPV  =  9500  /  (9500  + 10) » 1  ✓ 

(near  1  is  better)  (near  1  is  better) 

The  Operator’s  View 

Note.  The  matrix  tabulates  the  number  of  TP,  FN,  FP,  and  TN  items  processed  by  the 
system.  Pd  and  Pfa  summarize  the  developer’s  view  of  the  system’s  performance  while 
PPV  and  NPV  summarize  the  operator’s  view.  In  this  notional  example,  the  low  PPV  of 
0.15  indicates  a  poor  operator  experience  (the  system  often  generates  false  alarms  and 
“cries  wolf,”  since  only  15%  of  alarms  turn  out  to  be  true  threats)  even  though  the  good  Pd 
and  Pfa  are  well  received  by  developers. 

The  Developer’s  View:  Pd  and  Pfa 

A  PM  must  consider  how  much  of  the  truth  the  threat  detection  system 
is  able  to  identify.  This  can  be  done  by  considering  the  following  questions: 
Of  those  items  that  contain  a  true  threat,  for  what  fraction  does  the  system 
correctly  declare  an  alarm?  And  of  those  items  that  do  not  contain  a  true 
threat,  for  what  fraction  does  the  system  incorrectly  declare  an  alarm— a 
false  alarm?  These  questions  often  guide  developers  during  the  research 
and  development  phase  of  a  threat  detection  system. 

Pd  and  P  can  be  easily  calculated  from  the  2><2  confusion  matrix  to  answer 
these  questions.  From  a  developer’s  perspective,  this  notional  air  cargo 
screening  system  exhibits  good1  performance: 


Pd=  =  =  °-90  (compared  to  1  for  a  perfect  system)  (1) 

Pfa=  =  50o:°90Soo  =  °-50  (compared  to  0  for  a  perfect  system)  (2) 
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Equation  1  shows  that,  of  all  items  that  contained  a  true  threat  (TP  +  FN 
=  90  +  10  =  100),  a  large  subset  (TP  =  90)  correctly  caused  an  alarm.  These 
counts  resulted  in  PJ  =  0.90,  close  to  the  value  of  1  that  would  be  exhibited 
by  a  perfect  system.2  Based  on  this  Pd  value,  the  PM  can  conclude  that  90% 
of  items  that  contained  a  true  threat  correctly  caused  an  alarm,  which  may 
(or  may  not)  be  considered  acceptable  within  the  capability  requirements 
for  the  system.  Furthermore,  Equation  2  shows  that,  of  all  items  that  did  not 
contain  a  true  threat  (FP  +  TN  =  500  +  9,500  =  10,000),  only  a  small  subset 
(FP  =  500)  caused  a  false  alarm.  These  counts  led  to  P  =  0.05,  close  to  the 
value  of  0  that  would  be  exhibited  by  a  perfect  system.3  In  other  words,  only 
5%  of  items  that  did  not  contain  a  true  threat  caused  a  false  alarm. 

The  Operator’s  View:  PPV  and  NPV 

The  PM  must  also  anticipate  the  operator’s  view  of  the  threat  detection 
system.  One  way  to  do  this  is  to  answer  the  following  questions:  Of  those 
items  that  caused  an  alarm,  what  fraction  turned  out  to  contain  a  true 
threat  (i.e.,  what  fraction  of  alarms  turned  out  not  to  be  false)?  And  of  those 
items  that  did  not  cause  an  alarm,  what  fraction  turned  out  not  to  contain 
a  true  threat?  On  the  surface,  these  questions  seem  similar  to  those  posed 
previously  for  Pd  and  P  Upon  closer  examination,  however,  they  are  quite 
different.  While  P,,  and  P.  summarize  how  much  of  the  truth  causes  an 

d  fa 

alarm,  PPV  and  NPV  summarize  how  many  alarms  turn  out  to  be  true. 

PPV  and  NPV  can  also  be  easily  calculated  from  the  2x2  confusion  matrix. 
From  an  operator’s  perspective,  the  notional  air  cargo  screening  system 
exhibits  a  conflicting  performance: 


NPV  =  liWi  =  JooTio  “ 1  (compared  to  1  for  a  perfect  system)  (3) 

PPV  =  TpTPFP  =  9/°Q0  =  0.15  (compared  to  1  for  a  perfect  system)  (4) 


Equation  3  shows  that,  of  all  items  that  did  not  cause  an  alarm  (TN  +  FN 
=  9,500  +  10  =  9,510),  a  very  large  subset  (TN  =  9,500)  correctly  turned  out 
to  not  contain  a  true  threat.  These  counts  resulted  in  NPV  «  1,  approxi¬ 
mately  equal  to  the  1  value  that  would  be  exhibited  by  a  perfect  system.4  In 
the  absence  of  an  alarm,  the  operator  could  rest  assured  that  a  threat  was 
highly  unlikely.  However,  Equation  4  shows  that,  of  all  items  that  did  indeed 
cause  an  alarm  (TP  +  FP  =  90  +  500  =  590),  only  a  small  subset  (TP  =  90) 
turned  out  to  contain  a  true  threat  (i.e.,  were  not  false  alarms).  These  counts 
unfortunately  led  to  PPV  =  0.15,  much  lower  than  the  1  value  that  would  be 
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exhibited  by  a  perfect  system.5  When  an  alarm  was  declared,  the  operator 
could  not  trust  that  a  threat  was  present,  since  the  system  generated  false 
alarms  so  often. 


Reconciling  Developers  with  Operators:  Pd  and  Pfa  Versus 
PPV  and  NPV 

The  discrepancy  between  PPV  and  NPV  versus  Pd  and  P  reflects  the 
discrepancy  between  the  operator’s  and  developer’s  views  of  the  threat 
detection  system.  Developers  are  often  primarily  interested  in  how  much  of 
the  truth  correctly  cause  alarms— concepts  quantified  by  Pd  and  P  In  con¬ 
trast,  operators  are  often  primarily  concerned  with  how  many  alarms  turn 
out  to  be  true— concepts  quantified  by  PPV  and  NPV.  As  shown  in  Figure  2, 
the  very  same  system  that  exhibits  good  values  for  P  P  and  NPV  can  also 
exhibit  poor  values  for  PPV. 

Poor  PPV  values  should  not  be  unexpected  for  threat  detection  systems  in 
the  DoD  and  DHS.  Such  performance  is  often  merely  a  reflection  of  the  low 
Prev  of  true  threats  among  true  clutter  that  is  not  uncommon  in  defense  and 
homeland  security  scenarios.8  Prev  describes  the  fraction  of  all  items  that 
contain  a  true  threat,  including  those  that  did  and  did  not  cause  an  alarm. 
In  the  case  of  the  notional  air  cargo  screening  system,  Prev  is  very  low: 
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Prev  = 


TP  =  FN  =  FP  =  TN  90  +  10  +  500  +  9,500 


=  0.01 


(5) 


Equation  5  shows  that,  of  all  items  (TP  +  FN  +  FP  +  TN  =  90  +  10  +  500  + 
9,500  =  10,100),  only  a  very  small  subset  (TP  +  FN  =  90  + 10  =  100)  contained 
a  true  threat,  leading  to  Prev  =0.01.  When  true  threats  are  rare,  most  alarms 
turn  out  to  be  false,  even  for  an  otherwise  strong  threat  detection  system, 
leading  to  a  low  value  for  PPV  (Altman  &  Bland,  1994b).  In  fact,  to  achieve 
a  high  value  of  PPV  when  Prev  is  extremely  low,  a  threat  detection  system 
must  exhibit  so  few  FPs  (false  alarms)  as  to  make  Pft  approximately  zero. 

Recognizing  this  phenomenon,  PMs  should  not  necessarily  dismiss  a  threat 
detection  system  simply  because  it  exhibits  a  poor  PPV,  provided  that  it 
also  exhibits  an  excellent  Pd  and  P  Instead,  PMs  can  estimate  Prev  to  help 
determine  how  to  guide  such  a  system  through  development.  Prev  does  not 
depend  on  the  threat  detection  system  and  can,  in  fact,  be  calculated  in  the 
absence  of  the  system.  Knowledge  of  ground  truth  (which  items  contain  a 
true  threat)  is  all  that  is  needed  to  calculate  Prev  (Scheaffer  &  McClave, 
1995). 

Of  course,  ground  truth  is  not  known  a  priori  in  an  operational  setting. 
However,  it  may  be  possible  for  PMs  to  use  historical  data  or  intelligence 
tips  to  roughly  estimate  whether  Prev  is  likely  to  be  particularly  low  in 
operation.  The  threat  detection  system  can  be  thought  of  as  one  system 
in  a  system  of  systems,  where  other  relevant  systems  are  based  on  record 
keeping  (to  provide  historical  estimates  of  Prev)  or  intelligence  (to  provide 
tips  to  help  estimate  Prev).  These  estimates  of  Prev  can  vary  over  time  and 
location.  A  Prev  that  is  estimated  to  be  very  low  can  cue  the  PM  to  anticipate 
discrepancies  in  Pd  and  P  versus  PPV,  forecasting  the  inevitable  discrep¬ 
ancy  between  the  developer’s  versus  operator’s  views  early  in  the  system’s 
development,  while  there  are  still  time  and  opportunity  to  make  adjust¬ 
ments.  At  that  point,  the  PM  can  identify  a  concept  of  operations  (CONOPS) 
in  which  the  system  can  still  provide  value  to  the  operator  for  an  assigned 
mission.  A  tiered  system  may  provide  one  such  opportunity. 
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A  Tiered  System  for  Threat  Detection 

Tiered  systems  consist  of  multiple  systems  used  in  series.  The  first 
system  cues  the  use  of  the  second  system  and  so  on.  Tiered  systems  provide 
PMs  the  opportunity  to  leverage  multiple  threat  detection  systems  that, 
individually,  do  not  satisfy  both  developers  and  operators  simultaneously. 
Figure  3  shows  two  2><2  confusion  matrices  that  represent  a  notional  tiered 
system  that  makes  use  of  two  individual  threat  detection  systems.  The  first 
system  (top)  is  relatively  simple  (and  inexpensive)  while  the  second  system 
(bottom)  is  more  complex  (and  expensive).  Other  tiered  systems  can  consist 
of  three  or  more  individual  systems. 


FIGURE  3.  NOTIONAL  TIERED  SYSTEM  FOR  AIR  CARGO  SCREENING 


P^  90/ (90 +  10)  =  0.90  ^ 

PH  =  500/  (500  +  9500)  =  0.05  ^ 

PPV,  =  90  /  (90  +  500)  =  0.15/  NPV,  =  9500  /  (9500  + 10) » 1  ✓ 


Pd2 = 88  /  (88  +  2)  =  0.98  V 
Pfa2  =  20  /  (20  +  480)  =  0.04  ^ 

PPV2  =  88  /  (88  +  20)  =  0.81  ✓  NPV2  =  480  /  (480  +  2) « 1  ✓ 

PPV»,™» =  88 / (88  +  2°)  =  0-8'  ^  NPVovm||  =  (9500  +  480) / ((9500  +  480)  +  (10  +  2)) 


Note.  The  top  2x2  confusion  matrix  represents  the  same  notional  system  described  in 
Figures  1  and  2.  While  this  system  exhibits  good  Pd,  Pfa,  and  NPV  values,  its  PPV  value  is 
poor.  Nevertheless,  this  system  can  be  used  to  cue  a  second  system  to  further  analyze 
the  questionable  items.  The  bottom  matrix  represents  the  second  notional  system.  This 
system  exhibits  a  good  Pd,  Pfa,  and  NPV,  along  with  a  much  better  PPV.  The  second 
system’s  better  PPV  reflects  the  higher  Prev  of  true  threat  encountered  by  the  second 
system,  due  to  the  fact  that  the  first  system  had  already  successfully  screened  out  most 
items  that  did  not  contain  a  true  threat.  Overall,  the  tiered  system  exhibits  a  more  nearly 
optimal  balance  of  Pd,  Pfa,  NPV,  and  PPV  than  either  of  the  two  systems  alone. 
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The  first  system  is  the  notional  air  cargo  screening  system  discussed  previ¬ 
ously.  Although  this  system  exhibits  good  performance  from  the  developer’s 
perspective  (high  Pd  and  low  P  ),  it  exhibits  conflicting  performance  from 
the  operator’s  perspective  (high  NPV  but  low  PPV).  Rather  than  using 
this  system  to  classify  items  as  either  “Alarm  (Threat)”  or  “No  Alarm 
(No  Threat),”  the  operator  can  use  this  system  to  screen  items  as  either 
“Cue  Second  System  (Maybe  Threat)”  or  “Do  Not  Cue  Second  System  (No 
Threat).”  Of  the  10,100  items  that  passed  through  the  first  system,  590 
were  classified  as  “Cue  Second  System  ( Maybe  Threat)”  while  9,510  were 
classified  as  “No  Alarm  (No  Threat).”  The  first  system’s  extremely  high 

NPV  (approximately  equal  to  1)  means 
that  the  operator  can  rest  assured  that 
the  lack  of  a  cue  correctly  indicates 
the  very  low  likelihood  of  a  true  threat. 
Therefore,  any  item  that  fails  to  elicit 
a  cue  can  be  loaded  onto  the  airplane, 
bypassing  the  second  system  and 
avoiding  its  unnecessary  complexi¬ 
ties  and  expense.7  In  contrast,  the 
first  system’s  low  PPV  indicates  that 
the  operator  cannot  trust  that  a  cue 
indicates  a  true  threat.  Any  item  that 
elicits  a  cue  from  the  first  system  may 
or  may  not  contain  a  true  threat  and 
must  therefore  pass  through  the  sec¬ 
ond  system  for  further  analysis. 


Only  590  items  elicited  a  cue  from  the  first  system  and  passed  through  the 
second  system.  Ninety  items  contained  a  true  threat,  while  500  items  did 
not.  The  second  system  declared  an  alarm  for  108  items  and  did  not  declare 
an  alarm  for  482  items.  Comparing  the  items’  ground  truth  to  the  second 
system’s  alarms  (or  lack  thereof),  there  were  88  TPs,  2  FNs,  20  FPs,  and  480 
TNs.  On  its  own,  the  second  system  exhibits  a  higher  Pd  and  lower  P  than 
the  first  system,  due  to  its  increased  complexity  (and  expense).  In  addition, 
its  PPV  value  is  much  higher.  The  second  system’s  higher  PPV  may  be  due 
to  its  higher  complexity  or  may  simply  be  due  to  the  fact  that  the  second 
system  encounters  a  higher  Prev  of  true  threat  among  true  clutter  than  the 
first  system.  By  the  very  nature  in  which  the  tiered  system  was  assembled, 
the  first  system’s  very  high  NPV  indicates  its  strong  ability  to  screen  out 
most  items  that  do  not  contain  a  true  threat,  leaving  only  those  questionable 
items  for  the  second  system  to  process.  Since  the  second  system  encounters 
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only  those  items  that  are  questionable,  it  encounters  a  much  higher  Prev 
and  therefore  has  the  opportunity  to  exhibit  higher  PPV  values.  The  second 
system  simply  has  less  relative  opportunity  to  generate  false  alarms. 


The  utility  of  the  tiered  system  must  be  considered  in 
light  of  its  cost. 


The  utility  of  the  tiered  system  must  be  considered  in  light  of  its  cost.  In 
some  cases,  the  PM  may  decide  that  the  first  system  is  not  needed,  since  the 
second,  more  complex,  system  can  exhibit  the  desired  P  P  PPV,  and  NPV 
values  on  its  own.  In  that  case,  the  PM  may  choose  to  abandon  the  first  sys¬ 
tem  and  pursue  a  single-tier  approach  based  solely  on  the  second  system.  In 
other  cases,  the  added  complexity  of  the  second  system  may  require  a  large 
increase  in  resources  for  its  operation  and  maintenance.  In  these  cases,  the 
PM  may  opt  for  the  tiered  approach,  in  which  use  of  the  first  system  reduces 
the  number  of  items  that  must  be  processed  by  the  second  system,  reducing 
the  additional  resources  needed  to  operate  and  maintain  the  second  system 
to  a  level  that  may  balance  out  the  increase  in  resources  needed  to  operate 
and  maintain  a  tiered  approach. 

To  consider  the  utility  of  the  tiered  system,  its  performance  as  a  whole  must 
be  assessed,  in  addition  to  the  performance  of  each  of  the  two  individual 
systems  that  compose  it.  As  with  any  individual  system,  P  P  ,  PPV,  and 
NPV  can  be  calculated  for  the  tiered  system  overall.  These  calculations 
must  be  based  on  all  items  encountered  by  the  tiered  system  as  a  whole, 
taking  care  not  to  double  count  those  TP  and  FP  items  from  the  first  tier 
that  pass  to  the  second: 


Pfa=  tp,*w\+fni=  =  °-88  (compared  to  1  for  a  perfect  system) 

f,=i^3=  (9,500+480)^(10  +  2)  ~  0  (compared  to  0  for  a  perfect  system) 


NPV  = - enyrap - 

(TN^TNJ  +  fFN^FNJ 


(9,500  +  480) 
(9,500  +  480)  +  (10  +  2) 


1  (compared  to  1  for  a  perfect 
system) 


(6) 

(7) 

(8) 


PPV=  Tpl°FP  =  =  0.81  (compared  to  1  for  a  perfect  system)  (9) 
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Overall,  the  tiered  system  exhibits  good8  performance  from  the  developer’s 
perspective.  Equation  6  shows  that,  of  all  items  that  contained  a  true  threat 
(TP2  +  (FNj  +  FN2)  =  88  +  (10  +  2)  =  100),  a  large  subset  (TP2  =  88)  correctly 
caused  an  alarm,  resulting  in  an  overall  value  of  Pd  =  0.88.  The  PM  can 
conclude  that  88%  of  items  containing  a  true  threat  correctly  led  to  a  final 
alarm  from  the  tiered  system  as  a  whole.  Although  this  overall  Pd  is  slightly 
lower  than  the  Pd  of  each  of  the  two  individual  systems,  the  overall  value 
is  still  close  to  the  value  of  1  for  a  perfect  system9  and  may  (or  may  not)  be 
considered  acceptable  within  the  capability  requirements  for  the  envisioned 
CONOPS.  Similarly,  Equation  7  shows  that,  of  all  items  that  did  not  contain 
a  true  threat  (FP2  +  (TNd  +  TN2)  =  20  +  (9,500  +  480)  =  10,000),  only  a  very 
small  subset  (FP,  =  20)  incorrectly  caused  an  alarm,  leading  to  an  overall 
value  of  P  «  0.  Approximately  0%  of  items  not  containing  a  true  threat 
caused  a  false  alarm. 

The  tiered  system  also  exhibits  good10  overall  performance  from  the  oper¬ 
ator’s  perspective.  Equation  8  shows  that,  of  all  items  that  did  not  cause  an 
alarm  ((TNd  +  TN2)  +  (FNd  +  FN2)  =  (9,500  +  480)  +  (10  +  2)  =  9,992),  a  very 
large  subset  ((TNd  +  TN2)  =  (9,500  +  480)  =  9,980)  correctly  turned  out  not  to 
contain  atrue  threat,  resulting  in  an  overall  value  of  NPV  « 1.  The  operator 
could  rest  assured  that  a  threat  was  highly  unlikely  in  the  absence  of  a  final 
alarm.  More  interesting,  though,  is  the  overall  PPV  value.  Equation  9  shows 
that,  of  all  items  that  did  indeed  cause  a  final  alarm  ((TP  +  FP2)  =  (88  +  20)  = 
108),  a  large  subset  (TP2  =  88)  correctly  turned  out  to  contain  a  true  threat— 
these  alarms  were  not  false.  These  counts  resulted  in  an  overall  value  of 
PPV  =  0.81,  much  closer  to  the  1  value  of  a  perfect  system  and  much  higher 
than  the  PPV  of  the  first  system  alone.11  When  a  final  alarm  was  declared, 
the  operator  could  trust  that  a  true  threat  was  indeed  present  since,  overall, 
the  tiered  system  did  not  “cry  wolf”  very  often. 

Of  course,  the  PM  must  compare  the  overall  performance  of  the  tiered  sys¬ 
tem  to  capability  requirements  in  order  to  assess  its  appropriateness  for 
the  envisioned  mission  (DoD,  2015;  DHS,  2008).  The  overall  values  of  Pd  = 
0.88,  P  «  0,  NPV  « 1,  and  PPV  =0.81  may  or  may  not  be  adequate  once  these 
values  are  compared  to  such  requirements.  Statistical  tests  can  determine 
whether  the  overall  values  of  the  tiered  system  are  significantly  less  than 
required  (Fleiss,  Levin,  &  Paik,  2013).  Requirements  should  be  set  for  all 
four  metrics  based  on  the  envisioned  mission.  Setting  metrics  for  only  Pd 
and  P  effectively  ignores  the  operator’s  view,  while  setting  metrics  for  only 
PPV  and  NPV  effectively  ignores  the  developer’s  view.12  One  may  argue  that 
only  the  operator’s  view  (PPV  and  NPV)  must  be  quantified  as  capability 
requirements.  However,  there  is  value  in  also  retaining  the  developer’s  view 
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(P  and  Pfa),  since  Pd  and  P  can  be  useful  when  comparing  and  contrasting 
the  utility  of  rival  systems  with  similar  PPV  and  NPV  values  in  a  particular 
mission.  Setting  the  appropriate  requirements  for  a  particular  mission  is  a 
complex  process  and  is  beyond  the  scope  of  this  article. 


Threat  Detection  Versus  Threat 
Classification 


Unfortunately,  all  four  performance  metrics  cannot  be  calculated  for 
some  threat  detection  systems.  In  particular,  it  may  be  impossible  to  cal¬ 
culate  P.  and  NPV.  This  is  due  to  the  fact  that  the  term  “threat  detection 

fa 

system”  can  be  a  misnomer,  because  it  is  often  used  to  refer  to  threat  detec¬ 
tion  and  threat  classification  systems.  Threat  classification  systems  are 
those  that  are  presented  with  a  set  of  predefined,  discrete 
items.  The  system’s  task  is  to  classify  each  item  as 
either  “Alarm  (Threat)”  or  “No  Alarm  (No 
Threat).”  The  notional  air  cargo  screen¬ 
ing  system  discussed  in  this  article 
is  actually  an  example  of  a  threat 
classification  system,  despite 
the  fact  that  the  author  has 
colloquially  referred  to  it  as 
a  threat  detection  system 
throughout  the  first  half 
of  this  article.  In  contrast, 
genuine  threat  detection 
systems  are  those  that 
are  not  presented  with  a 
set  of  predefined,  discrete 
items.  The  system’s  task  is 
first  to  detect  the  discrete 
items  from  a  continuous 
stream  of  data  and  then  to 
classify  each  detected  item 
as  either  “Alarm  (Threat)” 
or  “No  Alarm  (No  Threat).” 

An  example  of  a  genuine  threat 
detection  system  is  the  notional 
counter-IED  system  illustrated  in 
Figure  4. 
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FIGURE  4.  NOTIONAL  COUNTER-IED  SYSTEM 


Note.  Several  items  are  buried  in  a  road  often  traveled  by  a  U.S.  convoy.  Some  items  are 
lEDs  (orange  stars),  while  others  are  simply  rocks,  trash,  or  other  discarded  items.  The 
system  continuously  collects  data  while  traveling  over  the  road  ahead  of  the  convoy 
and  declares  one  alarm  (red  exclamation  point)  for  each  location  at  which  it  detects  a 
buried  IED.  All  locations  for  which  the  system  declares  an  alarm  are  further  examined 
with  robotic  systems  (purple  arm)  operated  remotely  by  trained  personnel.  In  contrast,  all 
parts  of  the  road  for  which  the  system  does  not  declare  an  alarm  are  left  unexamined  and 
are  directly  traveled  over  by  the  convoy. 
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This  issue  is  more  than  semantics.  Proper  labeling  of  a  system’s  task  helps 
to  ensure  that  the  appropriate  performance  metrics  are  used  to  assess  the 
system.  In  particular,  while  P  and  NPV  can  be  used  to  describe  threat 
classification  systems,  they  cannot  be  used  to  describe  genuine  threat  detec¬ 
tion  systems.  For  example,  Equation  2  showed  that  Pfe  depends  on  FP  and 
TN  counts.  While  an  FP  is  a  true  clutter  item  that  incorrectly  caused  an 
alarm,  a  TN  is  a  true  clutter  item  that  correctly  did  not  cause  an  alarm.  FPs 
and  TNs  can  be  counted  for  threat  classification  systems  and  used  to  calcu¬ 
late  P  as  described  earlier  for  the  notional  air  cargo  screening  system. 


This  issue  is  more  than  semantics.  Proper  labeling  of 
a  system’s  task  helps  to  ensure  that  the  appropriate 
performance  metrics  are  used  to  assess  the  system. 


This  story  changes  for  genuine  threat  detection  systems,  however.  While 
FPs  can  be  counted  for  genuine  threat  detection  systems,  TNs  cannot. 
Therefore,  while  Pd  and  PPV  can  be  calculated  for  genuine  threat  detection 
systems,  P  and  NPV  cannot,  since  they  are  based  on  the  TN  count.  For  the 
notional  counter- IED  system,  an  FP  is  a  location  on  the  road  for  which  a  true 
IED  is  not  buried  but  for  which  the  system  incorrectly  declares  an  alarm. 
Unfortunately,  a  converse  definition  for  TNs  does  not  make  sense:  How 
should  one  count  the  number  of  locations  on  the  road  for  which  a  true  IED 
is  not  buried  and  for  which  the  system  correctly  does  not  declare  an  alarm? 
That  is,  how  often  should  the  system  get  credit  for  declaring  nothing  when 
nothing  was  truly  there?  To  answer  these  TN-related  questions,  it  may  be 
possible  to  divide  the  road  into  sections  and  count  the  number  of  sections  for 
which  a  true  IED  is  not  buried  and  for  which  the  system  correctly  does  not 
declare  an  alarm.  However,  such  a  method  simply  converts  the  counter-IED 
detection  problem  into  a  counter-IED  classification  problem,  in  which  dis¬ 
crete  items  (sections  of  road)  are  predefined  and  the  system’s  task  is  merely 
to  classify  each  item  (each  section  of  road)  as  either  “Alarm  (IED)”  or  “No 
Alarm  (No  IED).”  This  method  imposes  an  artificial  definition  on  the  item 
(section  of  road)  under  classification:  How  long  should  each  section  of  road 
be?  Ten  meters  long?  One  meter  long?  One  centimeter  long?  Such  definitions 
can  be  artificial,  which  simply  highlights  the  fact  that  the  concept  of  a  TN 
does  not  exist  for  genuine  threat  detection  systems. 
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Therefore,  PMs  often  rely  on  an  additional  performance  metric  for  genuine 
threat  detection  systems— the  False  Alarm  Rate  (FAR).  FAR  can  often  be 
confused  with  both  P_  and  PPV.  In  fact,  documents  within  the  defense  and 
homeland  security  communities  can  erroneously  use  two  or  even  all  three 
of  these  terms  interchangeably.  In  this  article,  however,  FAR  refers  to  the 
number  of  FPs  processed  per  unit  time  interval,  or  unit  geographical  area, 
or  distance  (depending  on  which  metric— time,  area,  or  distance— is  more 
salient  to  the  envisioned  CONOPS): 


FAR  = 


FP 

total  time 


(10  a) 


FAR  = 


FP 

total  area 


(10b) 


FAR  = 


FP 

total  distance 


(10c) 


For  example,  Equation  10c  shows  that  one  could  count  the  number  of  FPs 
processed  per  meter  as  the  notional  counter-IED  system  travels  down  the 
road.  In  that  case,  FAR  would  have  units  of  nr1.  In  contrast,  P ,,  P„ ,  PPV,  and 
NPV  are  dimensionless  quantities.  FAR  can  be  a  useful  performance  metric 
in  situations  for  which  P  cannot  be  calculated  (such  as  for  genuine  threat 
detection  systems)  or  for  which  it  is  prohibitively  expensive  to  conduct  a  test 
to  fill  out  the  full  2><2  confusion  matrix  needed  to  calculate  Pr . 

fa 


Conclusions 

Several  metrics  can  be  used  to  assess  the  performance  of  a  threat  detec¬ 
tion  system.  Pd  and  P  summarize  the  developer’s  view  of  the  system, 
quantifying  how  much  of  the  truth  causes  alarms.  In  contrast,  PPV  and 
NPV  summarize  the  operator’s  perspective,  quantifying  how  many  alarms 
turn  out  to  be  true.  The  same  system  can  exhibit  good  values  for  Pd  and  P 
during  testing  but  poor  PPV  values  during  operational  use.  PMs  can  still 
make  use  of  the  system  as  part  of  a  tiered  system  that,  overall,  exhibits  better 
performance  than  each  individual  system  alone. 
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Endnotes 

1  PMs  must  determine  what  constitutes  a  “good”  performance.  For  some 
systems  operating  in  some  scenarios,  Pd  =  0.90  is  considered  “good,”  since  only 
10  FNs  out  of  100  true  threats  is  considered  an  acceptable  risk.  In  other  cases,  Pd 

=  0.90  is  not  acceptable.  Appropriately  setting  a  system's  capability  requirements 
calls  for  a  frank  assessment  of  the  likelihood  and  consequences  of  FNs  versus  FPs 
and  is  beyond  the  scope  of  this  article. 

2  Statistical  tests  can  determine  whether  the  system’s  value  is  significantly 
different  from  the  perfect  value  or  the  capability  requirement  (Fleiss,  Levin,  &  Paik, 
2013). 

3  Ibid. 

4  Ibid. 

5  Ibid. 

6  Conversely,  when  Prev  is  high,  threat  detection  systems  often  exhibit  poor 
values  for  NPV,  even  while  exhibiting  excellent  values  for  Pd,  Pfa,  and  PPV.  Such 
cases  are  not  discussed  in  this  article,  since  fewer  scenarios  in  the  DoD  and  DHS 
involve  a  high  prevalence  of  threat  among  clutter. 
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7  PMs  must  decide  whether  the  10  FNs  from  the  first  system  are  acceptable 
with  respect  to  the  tiered  system’s  capability  requirements,  since  the  first  system's 
FNs  will  not  have  the  opportunity  to  pass  through  the  second  system  and  be  found. 
Setting  capability  requirements  is  beyond  the  scope  of  this  article. 

8  PMs  must  determine  what  constitutes  a  “good”  performance  when  setting 
the  capability  requirements  for  the  tiered  system. 

9  Statistical  tests  can  show  which  differences  are  statistically  significant  (Fleiss 
et  al.,  2013),  while  subject  matter  expertise  can  determine  which  differences  are 
operationally  significant. 

10  Once  again,  PMs  must  determine  what  constitutes  a  “good”  performance 
when  setting  the  capability  requirements  for  the  tiered  system. 

11  Once  again,  statistical  tests  can  show  which  differences  are  statistically 
significant  (Fleiss  et  al.,  2013),  while  subject  matter  expertise  can  determine  which 
differences  are  operationally  significant. 

12  All  four  of  these  metrics  are  correlated,  since  all  four  metrics  depend 
on  the  system’s  threshold  for  alarm.  For  example,  tuning  a  system  to  lower  its 
alarm  threshold  will  increase  its  Pd  at  the  cost  of  also  increasing  its  Pfa.  Thus, 

Pa  cannot  be  considered  in  the  absence  of  P,  and  vice  versa.  To  examine  this 

d  fa 

correlation,  Pd  and  Pfa  are  often  plotted  against  each  other  while  the  system’s  alarm 
threshold  is  systematically  varied,  creating  a  Receiver-Operating  Characteristic 
curve  (Urkowitz,  1967).  Similarly,  lowering  the  system’s  alarm  threshold  will  also 
affect  its  PPV.  To  explore  the  correlation  between  Pd  and  PPV,  these  metrics 
can  also  be  plotted  against  each  other  while  the  system’s  alarm  threshold  is 
systematically  varied  in  order  to  form  a  Precision-Recall  curve  (Powers,  2011). 

(Note  that  PPV  and  Pd  are  often  referred  to  as  Precision  and  Recall,  respectively, 
in  the  information  retrieval  community  [Powers,  2011],  Also,  Pd  and  Pfa  are  often 
referred  to  as  Sensitivity  and  One  Minus  Specificity,  respectively,  in  the  medical 
community  [Altman  &  Bland,  1994a].)  Furthermore,  although  Pd  and  Pfa  do  not 
depend  upon  Prev,  PPV  and  NPV  do.  Therefore,  PMs  must  take  Prev  into  account 
when  setting  and  testing  system  requirements  based  on  PPV  and  NPV.  Such 
considerations  can  be  done  in  a  cost-effective  way  by  designing  the  test  to  have 
an  artificial  prevalence  of  0.5  and  then  calculating  PPV  and  NPV  from  the  Pd  and 
Pfa  values  calculated  during  the  test  and  the  more  realistic  Prev  value  estimated  for 
operational  settings  (Altman  &  Bland,  1994b). 
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